Music Intelligence with Spotify

Spotify is the largest on-demand music service provider, in large part due to their implementation of new technologies and application of big data. Spotify’s competitive advantage is their ability to transform streaming into a personalized experience built by machine learning and natural language processing algorithms. Spotify uses recommendation systems to engage and retain listeners, increase customer satisfaction, and increase revenue.

Music data gives insights into a listener’s psyche. Use cases of music data are continuing to grow; It’s used to curate targeted marketing campaigns, plan artist tour routes, and track music tastes. Spotify's data collection is referred to as “emotional surveillance” or “music intelligence”.

The data science objective within this project is to understand the listener’s profile. We’ll be analyzing the mood of user playlists and recommend songs using a content-based filtering system.

In this analysis, we'll be focusing on several key questions.

  • What is the distribution of audio features in a large dataset of songs?
  • How do playlist audio features differ across users?
  • What songs are similar to one another – can they be classified into genres?
  • What songs can we recommend to users using a content-based recommender system?
  • What other playlists is a user's music most similar to?
!pip install spotipy
!pip install dash
!pip install jupyter_dash
from dash import Dash, dcc, html, Input, Output
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
from jupyter_dash import JupyterDash
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.figure_factory import create_2d_density
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn.manifold import TSNE
from scipy.spatial.distance import cdist

We’ll be performing data collection through the use of the Spotify API. Using a client ID and secret key acquired through spotify's developer dashboard, we are able to access playlist data using links sent to us by friends

cid = '4819f4999fb246ee975d7bccacc5162b'
secret = '41dce8cfa93041bc94b04454c9b5ed8b'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
stored_playlists={\
'huseinH_playlist':'https://open.spotify.com/playlist/787g5agmc3mkhMP2yuuUq4?si=I9XqOCjHRpeTp0OATzR1pA',\
'cristinC_playlist':'https://open.spotify.com/playlist/1hJGuIjNJ5FqnzpHPLhiBq?si=FyEfvFXiRSetkuAKyyF13Q',\
'haleyS_playlist':'https://open.spotify.com/playlist/3fyXTn20ryd5Qi8tVgbrQE?si=l5Ke7OWiRpGbUONlWdj86Q',\
'samH_playlist':'https://open.spotify.com/playlist/7uT4eRwuzhWRoH5pySFvJs?si=HuIZ_hmGTqaC5SsEdcTHRg',\
'jennaS_playlist':'https://open.spotify.com/playlist/0NPdcEC2d23aRUGVX3lWg4?si=b-iboKJyQB6DEuWQ7aAu7A',\
'natalieH_playlist':'https://open.spotify.com/playlist/6RBxXZ5qbNQ1uLoVAa7NbX?si=IWr68P2DRziv_lzODVNTAg',\
'mehulG_playlist':'https://open.spotify.com/playlist/1UnaV6jlGUbOcQfAy1sX5t?si=H4Ard-QHQ4u4t7VzUQoHYg',\
'architM_playlist':'https://open.spotify.com/playlist/1CJ0iKHy0CsEwwgIBlFEGQ?si=H5iLkQJlRKe7KzBRbUcVKw',\
'callyL_playlist':'https://open.spotify.com/playlist/1vYuLAvb7R0OLJrEtcyDLD?si=lyyPqXKoS422fMLsm8DYNQ',\
'top_songs_global':'https://open.spotify.com/playlist/37i9dQZEVXbNG2KDcFcKOF?si=1333723a6eff4b7f',\
'get_turnt':'https://open.spotify.com/playlist/37i9dQZF1DWY4xHQp97fN6',\
'mellow_bars':'https://open.spotify.com/playlist/37i9dQZF1DWT6MhXz0jw61',\
'new_music_friday':'https://open.spotify.com/playlist/37i9dQZF1DX4JAvHpjipBk',\
'rap_caviar':'https://open.spotify.com/playlist/37i9dQZF1DX0XUsuxWHRQd',\
'pop_all_day':'https://open.spotify.com/playlist/37i9dQZF1DXarRysLJmuju',\
'songs_to_sing_in_the_car':'https://open.spotify.com/playlist/37i9dQZF1DWWMOmoXKqHTD',\
'level_up':'https://open.spotify.com/playlist/5Ea3GbZtAjQ4wEHfnbH3Bn?si=8PxpPcgYT5qS5sYo-IDlJQ' }
    

Spotipy API

Using the Spotipy API wrapper, we pulled songs and corresponding audio features from each playlist. We stored the data in a dictionary of pandas dataframes.

for name in stored_playlists:     
    
    link = stored_playlists[name]

    df = pd.DataFrame(columns = ['track name','artist','popularity','danceability','energy','key','loudness',\
                                 'mode','speechiness','acousticness','instrumentalness','liveness','valence',\
                                 'tempo'])
    df = df.astype({'track name':'string','artist':'string','popularity':'float64','danceability':'float64',\
                        'energy':'float64','key':'float64','loudness':'float64','mode':'float64',\
                        'speechiness':'float64','acousticness':'float64',\
                        'instrumentalness':'float64','liveness':'float64', \
                        'valence':'float64','tempo':'float64'})
    
    for song in sp.playlist_tracks(link)['items']:
        track = song['track']['name']
        artist = song['track']['artists'][0]['name']
        pop = song['track']['popularity']

        af = sp.audio_features(song['track']['uri'])[0]

        df = df.append({'track name':track,'artist':artist,'popularity':pop,'danceability':af['danceability'],\
                        'energy':af['energy'],'key':af['key'],'loudness':af['loudness'],'mode':af['mode'],\
                        'speechiness':af['speechiness'],'acousticness':af['acousticness'],\
                        'instrumentalness':af['instrumentalness'],'liveness':af['liveness'], \
                        'valence':af['valence'],'tempo':af['tempo']},ignore_index=True)
  
    stored_playlists[name] = df

We're going to create a 'master' dataframe with all of the playlist data to use for analysis. We also use the all_playlists data to fit the MinMaxScaler in order to scale all of the audio features with a range outside of [0,1] to the same scale.

all_playlists = pd.DataFrame(columns = ['user','track name','artist','popularity','danceability','energy','key','loudness',\
                                 'mode','speechiness','acousticness','instrumentalness','liveness','valence',\
                                 'tempo'])

all_playlists = all_playlists.astype({'user':'string','track name':'string','artist':'string','popularity':'float64','danceability':'float64',\
                        'energy':'float64','key':'float64','loudness':'float64','mode':'float64',\
                        'speechiness':'float64','acousticness':'float64',\
                        'instrumentalness':'float64','liveness':'float64', \
                        'valence':'float64','tempo':'float64'})

for k,v in stored_playlists.items():
  v['user']=k
  all_playlists = pd.concat([all_playlists,v])
sc = MinMaxScaler(feature_range=(0,1)) 
all_playlists[['popularity','loudness','key','tempo']] = sc.fit_transform(all_playlists[['popularity','loudness','key','tempo']])
for k,v in stored_playlists.items():
  v[['popularity','loudness','key','tempo']] = sc.transform(v[['popularity','loudness','key','tempo']])
  stored_playlists[k]= v
all_playlists.describe()
popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo
count 1233.000000 1233.000000 1233.000000 1233.000000 1233.000000 1233.000000 1233.000000 1233.000000 1233.000000 1233.000000 1233.000000 1233.000000
mean 0.616010 0.680898 0.627210 0.469291 0.727058 0.614761 0.152149 0.210386 0.012284 0.174648 0.469537 0.470498
std 0.186951 0.146556 0.159475 0.331254 0.104926 0.486849 0.133760 0.226800 0.078862 0.127465 0.217964 0.191726
min 0.000000 0.208000 0.078400 0.000000 0.000000 0.000000 0.023600 0.000079 0.000000 0.026400 0.034800 0.000000
25% 0.520000 0.580000 0.521000 0.090909 0.664250 0.000000 0.045100 0.030600 0.000000 0.099000 0.300000 0.315820
50% 0.650000 0.696000 0.628000 0.454545 0.739124 1.000000 0.090900 0.122000 0.000000 0.122000 0.460000 0.468501
75% 0.750000 0.793000 0.741000 0.727273 0.799637 1.000000 0.244000 0.334000 0.000046 0.204000 0.619000 0.612993
max 1.000000 0.971000 0.989000 1.000000 1.000000 1.000000 0.737000 0.978000 0.954000 0.925000 0.973000 1.000000

Audio feature box plots

Spotify provides these audio features in order to better understand the underlying data of audio tracks. We've created box plots in order to gauge the distribution of audio features.

From this plot, we can see that in the playlists provided we have high popularity, energy, danceabililty, and loudness, and low acousticness, liveness, and speechiness. There is a distributed range of tempo and valence for the songs.

columns = ['popularity','danceability','energy','valence','tempo','acousticness','liveness','speechiness','loudness']
fig = go.Figure()
for n,col in enumerate(columns):
  fig.add_trace(go.Box(y=all_playlists[col].values, name=all_playlists[col].name,boxpoints='all',\
                       text=all_playlists['track name']))

fig.update_layout(
    title_text = "All Playlists Audio Features",
    showlegend = True,
    paper_bgcolor = "white",
    width = 900
)
fig.show(renderer='notebook')

Context:

  • Popularity: The popularity of the artist. The value will be between 0 and 100, with 100 being the most popular.
  • Danceability: How suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
  • Energy: A measure from 0.0 to 1.0 representing a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
  • Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (happy, cheerful, euphoric), while tracks with low valence sound more negative (sad, depressed, angry).
  • Tempo: The overall estimated tempo of a track in beats per minute (BPM). Tempo is the speed or pace of a given piece and derives directly from the average beat duration.
  • Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
  • Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
  • Speechiness: Detects the presence of spoken words in a track. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, values below 0.33 represent non-speech-like tracks.
  • Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude).

User Radio Graphs

In order to compare playlists to one another, we've aggregated the audio features to display on radio graphs.
The graphs display the average audio features for each user's playlist and give insight into the type of music that they like. In analyzing these graphs, we roughly decipher what users like similar music, and also how music tastes differ.

graph = ['cristinC_playlist','haleyS_playlist','mehulG_playlist','samH_playlist','architM_playlist','top_songs_global','level_up','callyL_playlist']
fig = make_subplots(rows=4, cols=2, specs=[[{'type': 'polar'}]*2]*4)

colors = ["","mediumseagreen","darkorange","mediumpurple","magenta","limegreen","gold","blue","green","red","yellow"]
placement = [[],[1,1],[1,2],[2,1],[2,2],[3,1],[3,2],[4,1],[4,2]]

for n,name in enumerate(graph,start=1):

    temp = pd.DataFrame(stored_playlists[name],columns=['popularity','danceability','energy','loudness','valence','tempo','speechiness','acousticness'])    
    fig.add_trace(
        go.Scatterpolar(
            r=temp.mean().values,
            theta=temp.columns,
            fill='toself',
            name= name, 
            fillcolor = colors[n], opacity=0.3, line=dict(color=colors[n])
            ), row=placement[n][0], col=placement[n][1])
    
fig.update_layout(polar=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar2=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar3=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar4=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar5=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar6=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar7=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar8=dict(radialaxis=dict(visible=False,range = [0,1],),),
    paper_bgcolor = "white")

fig.update_layout(height=1000, width=1000, title_text="Playlist Audio Features",showlegend= True)
fig.show(renderer='notebook')

We can look at the same data conveyed on one radar chart where you can also filter to look at specific users.

This graph makes it easier to compare two playlist at once, and also gives a good overview of all the playlist stacked on top of eachother.

fig = go.Figure()

graph = ['cristinC_playlist','haleyS_playlist','samH_playlist','jennaS_playlist','natalieH_playlist','top_songs_global','mehulG_playlist','huseinH_playlist','level_up','callyL_playlist','get_turnt','mellow_bars','new_music_friday','rap_caviar','pop_all_day','songs_to_sing_in_the_car']
colors = ["","maroon","rosybrown","mediumseagreen","darkorange","mediumpurple","magenta","limegreen","gold","blue","yellow","red","purple","gold","green","orange","pink","cyan"]

for n,name in enumerate(graph,start=1):

    temp = pd.DataFrame(stored_playlists[name],columns=['popularity','danceability','energy','loudness','valence','tempo','speechiness','acousticness'])    
    fig.add_trace(
        go.Scatterpolargl(
            r=temp.mean().values,
            theta=temp.columns,
            name= name, 
            marker=dict(size=15, color=colors[n])))

fig.update_traces(mode="markers", marker=dict(line_color='white', opacity=0.7))

fig.update_layout(
    title = "Playlist Audio Features",
    font_size = 15,
    showlegend = True,
    paper_bgcolor = "white"
)
fig.update_layout(polar=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar2=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar3=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar4=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar5=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar6=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar7=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar8=dict(radialaxis=dict(visible=False,range = [0,1],),),
    polar9=dict(radialaxis=dict(visible=False,range = [0,1],),),
)
fig.show(renderer='notebook')

Correlation Graphs

Visualize the correlations between different audio features.

scatter_df = all_playlists[['popularity','danceability','energy','loudness','valence','tempo','speechiness','acousticness']]
fig = px.scatter_matrix(scatter_df)
fig.show(renderer='notebook')

There's a high positive correlation between 'energy' - 'loudness' and a high negative correlation between 'energy' and 'acousticness'. We'll look at this in more detail below.

fig = px.density_heatmap(all_playlists, x=all_playlists['energy'], y=all_playlists['loudness'], nbinsx=30, nbinsy=30, color_continuous_scale="YlGnBu")
fig.show(renderer='notebook')

Clustering song features

We can cluster song features together using K-means and visualize with TSNE. These clusters of songs are our own created genres.

This tells us what songs are the most alike to one another. The genres aren't explicitly defined as 'pop' or 'rap' but rather are a collection of songs with the same audio features, and may very well fall into the same genres.

Dash is an interactive feature of plotly used to create interactive dashboards. In this case, we're using dash in order to visualize the songs with different numbers of clusters and in different dimensions.

First, we create dropdowns to configure the graph functionality.

Then, in the "update graph" portion, we configure how we are going to display the graph. Because we are changing the clustering or dimensionality when we change the values in the dropdown, we recompute the k-means pipeline here.

import warnings
warnings.filterwarnings('ignore')
app = JupyterDash(__name__)

app.layout = html.Div([
    html.Div([
        html.Div([dcc.Dropdown(
            options={'2': '2 dimensions','3': '3 dimensions'},
              value='2',
              id='dimensions'
            )]), #style={'width': '48%', 'display': 'inline-block'}),
        html.Div([dcc.Dropdown(
            options={'2': '2 clusters','3': '3 clusters','4': '4 clusters','5':'5 clusters','6':'6 clusters'},
              value='3',
              id='clusters'
            )]), #style={'width': '48%', 'display': 'inline-block'}),
    dcc.Graph(id='indicator-graphic'),]),])

@app.callback(Output('indicator-graphic', 'figure'),[Input('dimensions', 'value'),Input('clusters', 'value')])

def update_graph(dimensions, clusters):
  dimensions = dimensions.split(' ')[0]
  clusters = clusters.split(' ')[0]

  cluster_pipeline = Pipeline([('scaler', StandardScaler()), ('kmeans', KMeans(n_clusters=int(clusters)))])
  X = all_playlists.select_dtypes(np.number)
  cluster_pipeline.fit(X)
  all_playlists['cluster'] = cluster_pipeline.predict(X)

  tsne_pipeline = Pipeline([('scaler', StandardScaler()), ('tsne', TSNE(n_components=int(dimensions), verbose= False))])
  genre_embedding = tsne_pipeline.fit_transform(X)

  if int(dimensions) == 2:
    projection = pd.DataFrame(columns=['x', 'y'], data=genre_embedding)
    projection.insert(0,'cluster', all_playlists['cluster'].tolist(), True) 
    projection.insert(0,'name', all_playlists['track name'].tolist(), True) 
    fig = go.Figure(data=[go.Scatter(x = projection['x'],y = projection['y'],\
    mode = 'markers',marker=dict(color=projection['cluster'],colorscale='Viridis'))])
  elif int(dimensions) == 3:
    projection = pd.DataFrame(columns=['x', 'y','z'], data=genre_embedding)
    projection.insert(0,'cluster', all_playlists['cluster'].tolist(), True) 
    projection.insert(0,'name', all_playlists['track name'].tolist(), True) 
    fig = go.Figure(data=[go.Scatter3d(x = projection['x'],y = projection['y'],z=projection['z'],\
    mode = 'markers',marker=dict(color=projection['cluster'],colorscale='Viridis'))])
  fig.update_traces(hovertemplate=projection['name'])
  return fig

app.run_server(mode='inline',debug=True)

Content-based recommendation system

In our corpus of songs, you can recommend songs based on what your friends like.

First, we calculate the means of each vector of audio features for the center. Then, using cosine similarity we find the top closest songs.

users = ['cristinC_playlist','haleyS_playlist','callyL_playlist','samH_playlist','jennaS_playlist','natalieH_playlist','architM_playlist','mehulG_playlist','huseinH_playlist',\
         'top_songs_global','get_turnt','mellow_bars','new_music_friday','rap_caviar','pop_all_day','songs_to_sing_in_the_car','level_up']
columns=['popularity','danceability','energy','loudness','valence','tempo','speechiness','acousticness','instrumentalness','liveness','mode','key']
user_means = np.empty((0,len(columns)))
for n,user in enumerate(users):
  temp = np.array(all_playlists.loc[all_playlists['user'] == user][columns].mean(),dtype=float)
  temp = temp.reshape((1,len(columns)))
  user_means = np.vstack((user_means,temp))

cluster_df = pd.DataFrame(user_means)
cluster_df['user']=users
def get_recs(user_center,user_name,all_playlists):
  current_songs = all_playlists.loc[all_playlists['user'] == user_name]['track name'].tolist()
  current_songs = {i:0 for i in current_songs}

  distances = cdist(user_center.reshape(1, -1), all_playlists[columns], 'cosine')
  index = list(np.argsort(distances)[:,:50][0])
  recs = all_playlists.iloc[index]

  song_recs = []
  for n,i in enumerate(recs['track name'].tolist()):
    if i not in current_songs:
      song_recs.append(i)
  return list(set(song_recs))[:10]

for n,user in enumerate(user_means):
    if n > 8:
      break
    recs = get_recs(user,users[n],all_playlists)
    user = users[n].split('_')[0]
    print(user)
    print(recs)
    print()
cristinC
['You and Tequila (feat. Grace Potter)', 'Hurt Somebody (With Julia Michaels)', 'Atlanta Girls', 'goosebumps', 'Hate The Other Side (with Marshmello & The Kid Laroi)', 'Ruin My Life', 'Sunshine (feat. Lil Wayne & Childish Gambino)', 'Leave Me Alone', 'good 4 u', 'Un Coco']

haleyS
['feeling this bad (v2)', 'Yo Yo', 'Hate The Other Side (with Marshmello & The Kid Laroi)', 'Sunshine (feat. Lil Wayne & Childish Gambino)', 'No Gold Teeth', 'Leave Me Alone', 'We Are Never Ever Getting Back Together', 'Blank Space', 'Alright', 'Little Dance']

callyL
['You and Tequila (feat. Grace Potter)', 'Hurt Somebody (With Julia Michaels)', 'Heavy (with Lil Uzi Vert)', 'abcdefu', 'goosebumps', 'Atlanta Girls', 'Leave Me Alone', 'good 4 u', 'Alright', 'Un Coco']

samH
['CHAMPAGNE', 'Yo Yo', 'Famous', 'feeling this bad (v2)', 'goosebumps', 'Just like Heaven', 'Everybody Talks', 'Leave Me Alone', 'Meteorite', 'Satellite']

jennaS
['You and Tequila (feat. Grace Potter)', 'Hurt Somebody (With Julia Michaels)', 'Heavy (with Lil Uzi Vert)', 'abcdefu', 'goosebumps', 'fingers crossed', 'Hate The Other Side (with Marshmello & The Kid Laroi)', 'good 4 u', 'KEEP IT UP', 'Without Me']

natalieH
['CHAMPAGNE', 'Famous', 'goosebumps', 'Everybody Talks', 'Nice For What', 'Leave Me Alone', 'Acapulco', 'Blank Space', 'Alright', 'Shout Out to My Ex']

architM
['Hurt Somebody (With Julia Michaels)', 'Heavy (with Lil Uzi Vert)', 'feeling this bad (v2)', 'Yo Yo', 'Leave Me Alone', 'Acapulco', 'Blank Space', 'Alright', 'Un Coco', 'Sky']

mehulG
['Heavy (with Lil Uzi Vert)', 'feeling this bad (v2)', 'abcdefu', 'Acapulco', 'Blank Space', 'Shout Out to My Ex', 'Un Coco', 'Sky', 'El Apagón', 'PUFFIN ON ZOOTIEZ']

huseinH
['You and Tequila (feat. Grace Potter)', 'Hurt Somebody (With Julia Michaels)', 'feeling this bad (v2)', 'Yo Yo', 'Atlanta Girls', 'abcdefu', 'Hate The Other Side (with Marshmello & The Kid Laroi)', 'Sunshine (feat. Lil Wayne & Childish Gambino)', 'Leave Me Alone', 'Blank Space']

For conciseness, we output the top 10 songs recommended to each user's playlist. By audio feature, these songs are all very similar to the audio features of the playlist.

The best 'true' metric of success in deciding if these recommendations are accurate or not is to input your own playlist and listen to our recommendations! The fun feature about these recommendations is that they are songs that your friends listen to.

Playlist Similarity

In addition to finding new tracks based on playlist features, we can also figure out what playlists are most similar to one another.

def similar_playlists(user_center,user_name,user_means,users):

  distances = cdist(user_center.reshape(1, -1), user_means, 'cosine')
  index = list(np.argsort(distances)[:,:4][0])
  recs = [users[x] for x in index]

  similar_users = []
  for n,i in enumerate(recs):
    if i != user_name:
      similar_users.append(i)
  return list(set(similar_users))[:10]

for n,user in enumerate(user_means):
    similar_users = similar_playlists(user,users[n],user_means,users)
    print(users[n])
    print(similar_users)
    print()
cristinC_playlist
['jennaS_playlist', 'new_music_friday', 'huseinH_playlist']

haleyS_playlist
['new_music_friday', 'songs_to_sing_in_the_car', 'huseinH_playlist']

callyL_playlist
['architM_playlist', 'cristinC_playlist', 'mehulG_playlist']

samH_playlist
['natalieH_playlist', 'new_music_friday', 'songs_to_sing_in_the_car']

jennaS_playlist
['cristinC_playlist', 'callyL_playlist', 'huseinH_playlist']

natalieH_playlist
['pop_all_day', 'songs_to_sing_in_the_car', 'samH_playlist']

architM_playlist
['level_up', 'mehulG_playlist', 'get_turnt']

mehulG_playlist
['architM_playlist', 'rap_caviar', 'level_up']

huseinH_playlist
['haleyS_playlist', 'jennaS_playlist', 'cristinC_playlist']

top_songs_global
['rap_caviar', 'pop_all_day', 'mehulG_playlist']

get_turnt
['level_up', 'rap_caviar', 'mehulG_playlist']

mellow_bars
['architM_playlist', 'level_up', 'get_turnt']

new_music_friday
['haleyS_playlist', 'pop_all_day', 'songs_to_sing_in_the_car']

rap_caviar
['level_up', 'mehulG_playlist', 'get_turnt']

pop_all_day
['new_music_friday', 'songs_to_sing_in_the_car', 'mehulG_playlist']

songs_to_sing_in_the_car
['haleyS_playlist', 'pop_all_day', 'new_music_friday']

level_up
['architM_playlist', 'mehulG_playlist', 'get_turnt']

Above we output the closest three playlists to each playlist. We can also look at similar playlists with k-means clustering.

cluster_pipeline = Pipeline([('scaler', StandardScaler()), ('kmeans', KMeans(n_clusters=6))])
X = cluster_df.select_dtypes(np.number)
cluster_pipeline.fit(X)
cluster_df['cluster'] = cluster_pipeline.predict(X)

tsne_pipeline = Pipeline([('scaler', StandardScaler()), ('tsne', TSNE(n_components=3, verbose=False))])
users_embedding = tsne_pipeline.fit_transform(X)

projection = pd.DataFrame(columns=['x', 'y','z'], data=users_embedding)
projection.insert(0,'cluster', cluster_df['cluster'].tolist(), True) 
projection.insert(0,'name', cluster_df['user'].tolist(), True) 
fig = go.Figure(data=[go.Scatter3d(x = projection['x'],y = projection['y'],z=projection['z'],\
mode = 'markers',marker=dict(color=projection['cluster'],colorscale='Viridis'))])

fig.update_traces(hovertemplate=projection['name'])
fig.show(renderer='notebook')
cluster_df[['user','cluster']].sort_values(by='cluster')
user cluster
16 level_up 0
13 rap_caviar 0
10 get_turnt 0
7 mehulG_playlist 0
6 architM_playlist 0
0 cristinC_playlist 1
4 jennaS_playlist 1
2 callyL_playlist 1
8 huseinH_playlist 1
1 haleyS_playlist 2
3 samH_playlist 3
5 natalieH_playlist 3
15 songs_to_sing_in_the_car 4
9 top_songs_global 4
11 mellow_bars 4
14 pop_all_day 4
12 new_music_friday 5

Looking at these clusters we can gauge their accuracy.

  • Cluster 0: Rap
  • Cluster 4: Pop

The users playlists fit into their own clusters, having similar taste in music.

Conclusion

Using a spotify listener's playlist and pulling the audio features for every track, we can learn alot about a spotify user and their friends.

In this report, we've discovered how audio features range, how songs cluster together, and how we can use audio features to recommend new songs. We've looked at the specific audio features for a user's playlist to compare a user to other users. This spotify music analysis is interactive and insightful, and can be fit to new data by adding the link to a new playlist.